There is a game where I continue to flip a coin until I get heads. If the first time I get heads is on the nth coin, then I pay you 2n-1 dollars. You can assume #that this is a fair coin. Create a program to show how much you should pay me to #play this game? Please make sure to include your results after running 1,000, 10,000, and 100,000 trials.
This scenario can be modeled using the negative binomial distribution. If n is the number of trials until the first heads, then the expected value of n will be given by r / p, where r is the number of heads p is the probability of heads occurring on any given trial
In this problem, we are interested in the first occurrence of heads, so we can denote r = 1, and the coin is fair, so the probability of success p = .5 Therefore, the expected value of n is 1 / .5 = 2
The payout associated with n trials 2n - 1, so given r = 1 and p = .5 we expect an average payout of 2 * 2 - 1 = $3. We expect to see this parametric result through simulation as well
set.seed(1)
library(parallel) # for `mclapply()` and `detectCores()`
# Define a function to run the game
game = function(.) # dot is a place holder, this fn takes no arguments
{
# Run one iteration of the game to generate a payout.
n <- 1 # start w/ the first trial
# Flip the coin. If it is not heads, iterate n and re-run
# Since the expected number of trials is 2
while (sample(c(T, F), size = 1)) {
# Note, since the probability of heads and tails are equal,
# we don't need to specify if T or F is "heads" or "tails"
n = n + 1 # record that we will initiate the n+1-th trial next
}
return(2*n - 1) # return the payout
}
# Define a function to run the game i times, returning the vector of payouts
payouts = function(i) {
# Given a number of iterations to run, implement the game and return a vector
# containing the payouts associated with each iteration.
# i is numeric: the number of iterations of the game
# returns a numeric vector of length n containing the payout from each iteration
require(parallel)
# for each iteration i, play the game
# Thus, use mclapply to play the game in parallel i times
# Use detectCores() to detect how many cores this machine has
# Since I know I am running this on my personal 4-core machine, I am willing
# to use all cores. Usually use detectCores() - 1 (atleast) on shared resources
# Note: use unlist so that the result is a vector, not a list
# return the i-length vector of payouts
return(unlist(mclapply(X = 1:i, FUN = game, mc.cores = detectCores())))
}
# First simulation
t0 = Sys.time()
payout1000 = payouts(1000)
(t1000 = Sys.time() - t0) # record the time to run
## Time difference of 0.05900812 secs
# Next simulation
t0 = Sys.time()
payout10000 = payouts(10000)
(t10000 = Sys.time() - t0)
## Time difference of 0.603652 secs
# Final simulation
payout100000 = payouts(100000)
(t100000 = Sys.time() - t0)
## Time difference of 1.578739 secs
payouts = list(payout1000, payout10000, payout100000)
# On average, the mean payout is
sapply(payouts, mean)
## [1] 2.9320 3.0362 3.0023
As a result I would not pay more than $2.99 to play the game, and playing the game 100000 at $2.99 is expected to yield greater cumulative payout than playing 10000 or 1000 times.
rm(list=ls())
library(lubridate)
library(knitr)
library(viridis)
library(lme4)
library(lmerTest)
library(kableExtra)
library(MLmetrics)
library(caret)
library(tseries)
library(forecast)
#library(tidymodels)
suppressPackageStartupMessages(library(tidyverse))
set.seed(1)
df <- read_tsv("../data/Question2.tsv")
print(dim(df))
## [1] 981 3
kbl(head(df), 'html', align = 'r') %>% kable_styling()
| Date | Cost | Revenue |
|---|---|---|
| 2019-08-20 | $4,214.69 | $14,349.50 |
| 2019-08-21 | $2,627.13 | $13,463.13 |
| 2019-08-22 | $3,196.08 | $10,297.64 |
| 2019-08-23 | $1,134.45 | $6,296.62 |
| 2019-08-24 | $3,207.85 | $10,291.51 |
| 2019-08-25 | $7,325.59 | $24,733.41 |
NA and duplicatesdf %>%
summarise(across(everything(), list(zeros = ~sum(is.na(.)))),
duplicated.days = sum(duplicated(Date))) %>%
kbl('html') %>% kable_styling()
| Date_zeros | Cost_zeros | Revenue_zeros | duplicated.days |
|---|---|---|---|
| 0 | 0 | 0 | 1 |
There is one duplicated day, and there are no NAs
df %>%
filter(Date == df$Date[duplicated(df$Date)]) %>%
kbl('html') %>% kable_styling()
| Date | Cost | Revenue |
|---|---|---|
| 2021-09-03 | $5,432.35 | $20,120.06 |
| 2021-09-03 | $5,432.35 | $20,120.06 |
The above shows that the corresponding Cost and
Revenue are duplicated for the duplicated date, so just
delete one instance
df = df %>% filter(!duplicated(Date))
Cost and Revenue, which read
in as character/string data, to numeric valuesday, month,
and year, as well as “year day” and “month day” from the
Date column to aid in visualization and possibly
modeling.Profit variable, estimated as \(Revenue - Cost\)df1 = df %>%
mutate(
# convert the cost and revenue columns from character strings of the format
# $ [n-digit dollars] . [two-digit cents]
across(c(Cost, Revenue), ~{
dollars = as.numeric(gsub('$|[[:punct:]]', '', substr(., 1, nchar(.) - 2)))
cents = substr(., nchar(.) - 1, nchar(.))
as.numeric(paste(dollars, cents, sep = '.'))
}),
# From the date column, it may be handy to have the day, the month, year,
# the day within year (from 0 - 365 excluding leap years), day w/in month
day = day(Date), month = month(Date), year = factor(year(Date)),
yday = yday(Date), mday=mday(Date), # day w/in year and month
# For visualization it can be helpful to have just the month-day (no year)
month_and_day = as.factor(sub('.{5}', '', Date)),
# Code the variable profit, estimated by Revenue - cost
#Profit = Revenue - Cost, efficiency = Revenue / Cost)
Profit = Revenue - Cost)
head(df1) %>%
kbl('html') %>% kable_styling()
| Date | Cost | Revenue | day | month | year | yday | mday | month_and_day | Profit |
|---|---|---|---|---|---|---|---|---|---|
| 2019-08-20 | 4214.69 | 14349.50 | 20 | 8 | 2019 | 232 | 20 | 08-20 | 10134.81 |
| 2019-08-21 | 2627.13 | 13463.13 | 21 | 8 | 2019 | 233 | 21 | 08-21 | 10836.00 |
| 2019-08-22 | 3196.08 | 10297.64 | 22 | 8 | 2019 | 234 | 22 | 08-22 | 7101.56 |
| 2019-08-23 | 1134.45 | 6296.62 | 23 | 8 | 2019 | 235 | 23 | 08-23 | 5162.17 |
| 2019-08-24 | 3207.85 | 10291.51 | 24 | 8 | 2019 | 236 | 24 | 08-24 | 7083.66 |
| 2019-08-25 | 7325.59 | 24733.41 | 25 | 8 | 2019 | 237 | 25 | 08-25 | 17407.82 |
Revenueggplot(df1) +
geom_density(aes(x = Revenue, fill = year), color='black', alpha = .5, linewidth=.5) +
theme_minimal() +
scale_fill_viridis_d() +
facet_grid(rows = vars(year))
This variable shows a high degree of positive skew that is pretty consistent across years. Let’s view it on a log scale:
ggplot(df1) +
geom_density(aes(x = Revenue, fill = year), color='black', alpha = .5, linewidth=.5) +
theme_minimal() +
scale_fill_viridis_d() +
facet_grid(rows = vars(year)) +
scale_x_log10()
Transforming the axes (and implicitly visualizing
log(Revenue)) reveals a trend across years, and also
normalizes the data fairly well. Overall result on the variable after
log-transformation:
df1 %>%
ggplot +
geom_density(aes(x = log(Revenue)),
fill='skyblue', color='black', alpha = .5, linewidth=.5) +
theme_minimal()
# Format data for plotting
plt.dat = df1 %>%
mutate(`log(Revenue)` = log(Revenue), `log(Profit)` = log(Profit))
# plot of: Revenue (log transformed)
plt.dat %>%
ggplot +
geom_line(aes(x = yday, y = `log(Revenue)`, color = year, group = year)) +
# Label just the first day of each month:
scale_x_continuous(
name = 'Month',
breaks = as.numeric(plt.dat$yday)[(plt.dat$mday == 1)&(plt.dat$year==2021)],
labels = month(1:12, label = T)
) +
# set relevant y-axis
scale_y_continuous(limits = c(8,12.25)) +
theme_minimal()
# plot of: Profit = Revenue - Cost (log transformed)
plt.dat %>%
ggplot +
geom_line(aes(x = yday, y = `log(Profit)`, color = year, group = year)) +
# Label just the first day of each month:
scale_x_continuous(
name = 'Month',
breaks = as.numeric(plt.dat$yday)[(plt.dat$mday == 1)&(plt.dat$year==2021)],
labels = month(1:12, label = T)
) +
# set relevant y-axis
scale_y_continuous(limits = c(8,12.25)) +
theme_minimal()
For a year that we have complete data, 2021 shows the most revenue, and this appears to be a yearly trend such that 2019 < 2020 < 2021 < 2022.
All revenue (in log units) appears to increase approximately monotonically from January to the holiday season, such that, each month is higher on average than the previous. However a sharp decline starts mid December to make this rule inaccurate overall.
2020 appears to outlie relative to other years. It shows a lot more variability in the slow-frequency trend, with a surprisingly high Jan - Mar, followed by a large drop and then a big rebound. The dip and rebound may be due to the start of the COVID-19 pandemic, or noise.
Cost and
Revenueggplot(df1, aes(x = log(Cost), y = log(Revenue))) +
geom_point() +
geom_smooth(method = 'lm', formula = 'y~x') +
facet_wrap(~year,scales = 'fixed') +
theme_minimal()
This graph shows that despite the long positive skew in the distributions of Cost and Revenue, the relationship between Cost and Revenue is reliable even at extreme values of both. There appears to be a strong correlation between Cost and Revenue, and the relationship between Cost and log-transformed Revenue appears to be linear. As such, it appears appropriate to model log(Revenue) going forward. Finally, this trend appears consistent between across years, however, albeit based on fewer data points Cost appears to be less effective in generating Revenue in 2022.
df1 %>%
pivot_longer(cols = c('Revenue', 'Cost'), names_to = 'name', values_to = 'value') %>%
ggplot +
geom_line(aes(yday, y = log(value), color = name, group = name)) +
scale_x_continuous(
name = 'Month',
breaks = as.numeric(df1$yday)[(df1$mday == 1)&(df1$year==2021)],
labels = month(1:12, label = T),
) +
scale_y_continuous(limits = c(6.5,12.25)) +
facet_grid(rows = vars(year)) +
theme_bw()
(ct = cor.test(df1$Cost, df1$Revenue, method = 'spearman'))
##
## Spearman's rank correlation rho
##
## data: df1$Cost and df1$Revenue
## S = 24072006, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8465433
The dynamic coupling between Cost and Revenue is significant, \(\rho_{spearman}\) = 0.85 , p < .0001.
Furthermore, Cost seems to predict both short term
fluctuations, and long term trends in Revenue such that seasonal trends
may be redundant.
df1 %>%
ggplot +
geom_line(aes(x = Date, y = log(Revenue))) +
theme_bw()
The plot suggests that a yearly seasonal trend arises, which resets mid-December each year.
adf.test(df1$Revenue)
## Warning in adf.test(df1$Revenue): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: df1$Revenue
## Dickey-Fuller = -4.1153, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary
This suggests the time series is stationary and can be appropriately modeled using an ARIMA-like model (ARIMA or ARIMAX)
Conclusions from EDA and graphical analysis
df2 = df1 %>%
# Log transform Revenue and Cost
mutate(log.Revenue = log(Revenue),
log.Cost = log(Cost))
# Split December into two parts, before and after the peak (for that year)
for (y in unique(df2$year)) {
dec.data = df2 %>% filter(month == 12, year == y)
if (nrow(dec.data) > 0) { # if this year has december data
dec.data[-1:-which.max(dec.data$Cost), 'month'] = 13
df2[(df2$year == y) & (df2$month == 12), 'month'] = dec.data$month
}
}
# Store month as an ordered factor
df2 = df2 %>% mutate(across(month, ordered))
df3 = df2 %>%
# To aid interpretability, and since we're not using OLS,
select(log.Revenue, log.Cost, Revenue, Cost) %>%
ts # convert to time series using ts()
head(df3)
## Time Series:
## Start = 1
## End = 6
## Frequency = 1
## log.Revenue log.Cost Revenue Cost
## 1 9.571470 8.346331 14349.50 4214.69
## 2 9.507710 7.873647 13463.13 2627.13
## 3 9.239670 8.069680 10297.64 3196.08
## 4 8.747768 7.033903 6296.62 1134.45
## 5 9.239075 8.073356 10291.51 3207.85
## 6 10.115910 8.899129 24733.41 7325.59
Note, the first model I fit provided stationary = TRUE,
implicitly fitting an ARIMA model, but the residuals were
auto-correlated. The SARIMA model held better to the assumption that
residuals are independent. Finally, evidence for independence of
residuals was highest when using raw Revenue and Cost, rather than their
log transformations.
mod <- auto.arima(df3[, "Revenue"], xreg = df3[, "Cost"], stationary = F)
# Check model fit
mod
## Series: df3[, "Revenue"]
## Regression with ARIMA(1,1,2) errors
##
## Coefficients:
## ar1 ma1 ma2 xreg
## -0.6280 0.0417 -0.4760 3.7673
## s.e. 0.1358 0.1291 0.0758 0.0744
##
## sigma^2 = 32821109: log likelihood = -9858.98
## AIC=19727.96 AICc=19728.02 BIC=19752.39
The best fitting model suggests autocorrelation and non-stationarity,
as well as significant explanatory/predictive valuef
Cost.
plt.dat %>%
mutate(predicted = mod$fitted) %>%
pivot_longer(cols = c('Revenue', 'predicted'), names_to = 'name', values_to = 'value') %>%
ggplot +
geom_line(aes(yday, y = log(value), color = name, group = name)) +
scale_x_continuous(
name = 'Month',
breaks = as.numeric(df1$yday)[(df1$mday == 1)&(df1$year==2021)],
labels = month(1:12, label = T),
) +
scale_y_continuous(limits = c(6.5,12.25)) +
facet_grid(rows = vars(year)) +
theme_bw()
The model appears to fit very well with just Cost as a predictor
checkresiduals(mod)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(1,1,2) errors
## Q* = 11.67, df = 7, p-value = 0.1119
##
## Model df: 3. Total lags used: 10
The first day of the week we spend $17,007.92, how would you allocate the remaining budget throughout the week to try to maximize the amount of Revenue that we get for that channel?
budget = 116000 - 17007.92
new_data = tibble(
uniform = rep(1/6, times = 6),
increasing = c(.05, .05, .1, .15, .3, .35),
decreasing = rev(increasing),
max1 = ifelse(1:6 == 1, .80, (1 - .80)/5),
max2 = ifelse(1:6 == 2, .80, (1 - .80)/5), # one day takes majority of expenditure
max3 = ifelse(1:6 == 3, .80, (1 - .80)/5),
max4 = ifelse(1:6 == 4, .80, (1 - .80)/5),
max5 = ifelse(1:6 == 5, .80, (1 - .80)/5),
max6 = ifelse(1:6 == 6, .80, (1 - .80)/5)
)
all(sapply(new_data, sum) == 1) # they should integrate to 1
## [1] TRUE
# Different expenditure patterns:
(new_data = new_data %>% mutate(across(everything(), ~.*budget)))
## # A tibble: 6 × 9
## uniform increasing decreasing max1 max2 max3 max4 max5 max6
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 16499. 4950. 34647. 79194. 3960. 3960. 3960. 3960. 3960.
## 2 16499. 4950. 29698. 3960. 79194. 3960. 3960. 3960. 3960.
## 3 16499. 9899. 14849. 3960. 3960. 79194. 3960. 3960. 3960.
## 4 16499. 14849. 9899. 3960. 3960. 3960. 79194. 3960. 3960.
## 5 16499. 29698. 4950. 3960. 3960. 3960. 3960. 79194. 3960.
## 6 16499. 34647. 4950. 3960. 3960. 3960. 3960. 3960. 79194.
Plots show the forecasted predictions, and provide the cumulative revenue for the forecasted week.
predictions = map(new_data, ~{forecast(mod, xreg = .x)})
map(1:length(predictions), ~{
p = predictions[[.x]]
cond = names(new_data)[.x]
revenue = round(sum(as.vector(p$mean)), 2)
plt.t = paste0('Total Forecasted Revenue from 4/26/2022 to 5/1/2022
condition: ', cond, '; revenue: ', revenue)
autoplot(p) +
coord_cartesian(xlim = c(nrow(df3) - 90, nrow(df3) + 6)) +
ggtitle(plt.t) +
theme_minimal()
})
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
lapply(predictions[-1:-3], function(x) x$mean) %>%
as.data.frame %>%
as.matrix %>%
diag %>%
{data.frame(Date = tail(df2$Date,1) + 1:6, Revenue = .)} %>%
arrange(desc(Revenue)) %>%
kbl('html') %>% kable_styling()
| Date | Revenue |
|---|---|
| 2022-04-26 | 280490.9 |
| 2022-04-28 | 280091.3 |
| 2022-04-30 | 279933.8 |
| 2022-05-01 | 279766.7 |
| 2022-04-29 | 279667.7 |
| 2022-04-27 | 279416.8 |
The total revenue predictions are all the same because they are based solely on and the effect size of Cost in the model was small relative to the units of Revenue. However, a closer inspection reveals that holding all else constant, spending on this channel on 4/26/2022 will lead to the most Revenue.
minimum_cost = 100
while(T) {
revenue = forecast(mod, xreg = minimum_cost)$mean
if (revenue > 0) break else minimum_cost = minimum_cost + 10
}
data.frame(minimum_cost = minimum_cost, revenue = revenue)
## minimum_cost revenue
## 1 4740 1.837349
In addition, predicted revenue is in the red unless costs reach at least $4740. As such, it would be inadvisable to spend any less than $4,740 in the channel on our most advantageous day (4/26/2022)
Viewing the four scatterplots of the bivariate relationship between
Cost and Revenue within each year reveals that
each year. Additionally, the slope of the regression line predicting
Revenue from Cost seems to vary by year, suggesting a possible
interaction. Additionally, to account for the historical data obtained
for each month, we can assume each month to have a random intercept.
These aspects of the data can be modeled using multilevel models, with
random effects for various timing variables (e.g., year or month), in
addition to the effect of Cost.
mem1 = lmer(log.Revenue ~ (1|month) + log(Cost)*year, data = df2)
mem2 = lmer(log.Revenue ~ (1|year) + log(Cost), data = df2)
summary(mem1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log.Revenue ~ (1 | month) + log(Cost) * year
## Data: df2
##
## REML criterion at convergence: 131.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2849 -0.5698 0.0497 0.6316 4.0875
##
## Random effects:
## Groups Name Variance Std.Dev.
## month (Intercept) 0.01661 0.1289
## Residual 0.06219 0.2494
## Number of obs: 980, groups: month, 13
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 2.29297 0.33859 961.70169 6.772 2.21e-11 ***
## log(Cost) 0.92784 0.04180 971.99989 22.196 < 2e-16 ***
## year2020 2.17702 0.38600 971.99229 5.640 2.23e-08 ***
## year2021 -0.41698 0.39446 971.26037 -1.057 0.291
## year2022 2.53502 0.51455 971.98987 4.927 9.83e-07 ***
## log(Cost):year2020 -0.25612 0.04740 971.61705 -5.403 8.25e-08 ***
## log(Cost):year2021 0.03524 0.04754 971.92870 0.741 0.459
## log(Cost):year2022 -0.30298 0.06101 971.96645 -4.966 8.06e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) lg(Cs) yr2020 yr2021 yr2022 l(C):2020 l(C):2021
## log(Cost) -0.992
## year2020 -0.818 0.809
## year2021 -0.803 0.794 0.788
## year2022 -0.640 0.639 0.570 0.580
## lg(Cs):2020 0.826 -0.822 -0.997 -0.784 -0.573
## lg(Cs):2021 0.829 -0.826 -0.793 -0.996 -0.592 0.795
## lg(Cs):2022 0.672 -0.676 -0.578 -0.584 -0.995 0.585 0.602
summary(mem2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log.Revenue ~ (1 | year) + log(Cost)
## Data: df2
##
## REML criterion at convergence: 383.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4195 -0.6190 -0.0709 0.6385 3.5354
##
## Random effects:
## Groups Name Variance Std.Dev.
## year (Intercept) 0.004903 0.07002
## Residual 0.084863 0.29131
## Number of obs: 980, groups: year, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 3.15923 0.13081 282.33818 24.15 <2e-16 ***
## log(Cost) 0.82103 0.01505 935.79499 54.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## log(Cost) -0.960
While these models do show significant effects of year and month, there’s not much evidence to treat these aspects as random effects, due to the low variability between months and years relative to their standard errors. This suggests that we should stick with the ARIMA model
Based on the model forecasts, the week following April 25th 2022 I would suggest that which days within this week don’t matter as much as how much is spent on this channel. To that end, I would suggest using whatever is budgeted for this week. There is minor evidence that allocating expenditures to 4/26 and 4/28 is the most advantageous, followed by 4/30, 5/1, 4/29 and 4/27, however, the apparent differences in expected revenue across days may may be due to noise in the model and indicative of extrapolation error, so the latter suggestion should be treated with caution.
How much Revenue do you predict that we will make in that channel for the given week with the given budget?
Based on the first model, I predict the week will return $17,007.92 + $262,223.33 for the remaining days, totaling $279,231.20.
One important limitation is that these models were not cross-validated. Therefore the predictions they generate might be due to overfitting. Since we have extensive and dense historical data, this may not be a huge concern in terms of our ultimate predictions about Revenue. However, cross-validation can be a useful tool for model comparison. To that extent, we would be able to assess if the predictions from the multilevel models are inflated relative to the SARIMAX model.
rm(list=ls())
library(lubridate)
library(knitr)
library(viridis)
library(lme4)
library(lmerTest)
library(kableExtra)
library(MLmetrics)
library(caret)
library(tseries)
library(forecast)
library(GGally)
library(cluster)
library(parallel)
#library(tidymodels)
suppressPackageStartupMessages(library(tidyverse))
# Custom function to convert monetary notation to floating point / double
dollar_to_numeric = function(.) {
# Assumes format: $ dollars (any number of digits) . cents (2-digits)
dollars = as.numeric(gsub('$|[[:punct:]]', '', substr(., 1, nchar(.) - 2)))
cents = substr(., nchar(.) - 1, nchar(.))
as.numeric(paste(dollars, cents, sep = '.'))
}
set.seed(1)
df <- read_tsv("../data/Question3.tsv")
print(sapply(df, class))
head(df)
df = df %>% rename_with(~gsub(' ', '.', .)) %>%
rename(N.articles = Number.of.Articles.in.Order)
Order Revenue to numericdf = df %>% mutate(across(Order.Revenue, dollar_to_numeric))
print(setNames(dim(df), c('rows', 'columns')))
## rows columns
## 12323 5
print(paste0('Are there NaN? ', anyNA(df)))
## [1] "Are there NaN? FALSE"
psych::describe(df %>% select(where(is.numeric))) %>% kbl('html') %>% kable_styling()
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Customer.ID | 1 | 12323 | 15227.4097 | 1747.661 | 15078.0 | 15210.5686 | 2326.1994 | 12347.00 | 18283.00 | 5936.00 | 0.0758212 | -1.235251 | 15.743415 |
| Order.ID | 2 | 12323 | 558655.6037 | 12842.563 | 558419.0 | 558587.7139 | 16068.4188 | 536367.00 | 581586.00 | 45219.00 | 0.0463450 | -1.138768 | 115.689375 |
| N.articles | 3 | 12323 | 297.8672 | 623.318 | 167.0 | 195.5452 | 152.7078 | 1.00 | 15049.00 | 15048.00 | 10.3024877 | 158.897200 | 5.615022 |
| Order.Revenue | 4 | 12323 | 515.6503 | 1080.100 | 310.8 | 346.6314 | 234.4880 | 0.38 | 31698.16 | 31697.78 | 11.1844052 | 184.648797 | 9.729840 |
map(df %>% select(Customer.ID, Order.ID), n_distinct)
## $Customer.ID
## [1] 1277
##
## $Order.ID
## [1] 12323
There are 1277 customers, and a total of 12323 orders (the number of rows). This implies there aren’t any duplicates as well.
Revenueggplot(df) +
geom_density(aes(x = Order.Revenue)) +
geom_vline(aes(xintercept = median(Order.Revenue)), linetype=2) +
theme_minimal()
ggplot(df) +
geom_density(aes(x = Order.Revenue)) +
scale_x_log10() +
geom_vline(aes(xintercept = median(Order.Revenue)), linetype=2) +
theme_minimal()
The data is roughly normal after log-transforming Revenue, but there is an overabundance of customers with Revenue values near the median. Heretofore use log revenue instead of revenue
In order to classify a new customer as a valuable customer, it may be worthwhile to try to define valuable, based on the behaviors of existing customers. As a result, I will engineer variables to summarize their behavior across their interactions with the company, for example computing things like the median Revenue per article, the total number of articles, and the frequency with which they place orders.
beh = df %>%
mutate(Order.Revenue = log(Order.Revenue)) %>%
group_by(Customer.ID) %>%
summarise(
total_revenue = sum(Order.Revenue), # sum of all revenue from customer
n_orders = n(), # number of orders in total
# how recent was the last order?
recency = as.numeric((max(df$Order.Date) + 1) - max(Order.Date)),
# median order revenue, per order
median.rev_per_order = median(Order.Revenue),
# price of article, per order
rev_per_art_per_order = median(Order.Revenue / N.articles),
# sum total number of articles ordered
total.articles = sum(N.articles),
# the number of articles per order (median):
median.art_per_ord = median(N.articles),
# this variable captures how long customers's orders span in time
longevity = as.numeric(max(Order.Date) - min(Order.Date)),
# this counts how much time lags between orders
Consistency = as.numeric(mean(diff(Order.Date)))
)
### Plot relationships between variables
p = ggpairs(beh %>% select(-Customer.ID))
ggsave(plot=p + theme(text = element_text(size = 8)),
filename = "pairs.png", units = 'px', width = 2*1400, height = 2*700)
Viewing the correlations and distributions of this multivariate data can help us understand how to model it, and which features might be useful for predicting revenue.
Conclusions:
To run kmeans, the data should be normalized. Since several variable are skewed, and kmeans uses means to quantify distance between customers and cluster centroids, the data are best transformed prior to clustering.
Since there are some examples of multicolinearity in the data, and since several variables have been engineered out of just a couple columns, PCA is a wise choice to do both dimension reduction and normalization.
# get the data formatted, log transform skewed variables
cluster_dat = beh %>% select(-Customer.ID, -longevity) %>%
mutate(across(c(total_revenue, n_orders, recency, rev_per_art_per_order,
total.articles, median.art_per_ord),
log)) %>%
as.matrix
# Fit pca model
pca = prcomp(cluster_dat, center = T, scale. = T, retx=T)
# Fit the cluster model
cluster_model = kmeans(x = pca$x[,1:3], centers = 4)
# get the distance matrix to assess of cluster fit
D = dist(pca$x[,1:3], 'euc') # distance matrix
mean(silhouette(cluster_model$cluster, D)[, 'sil_width'])
## [1] 0.2758473
4 clusters seems like a reasonable solution
test = read.table(
textConnection("
17811 540566 2011-01-10 83 $185.97
17811 541648 2011-01-20 166 $337.70
17811 542149 2011-01-26 120 $108.55
17811 542929 2011-02-02 105 $289.30
17811 543801 2011-02-13 52 $210.70
17811 543839 2011-02-14 30 $103.70
17811 544450 2011-02-20 497 $1,271.55
17811 544642 2011-02-22 50 $139.80
17811 545279 2011-03-01 53 $122.50
17811 545700 2011-03-06 150 $166.45
17811 546524 2011-03-14 45 $110.84
17811 547197 2011-03-21 30 $108.42
17811 549048 2011-04-06 42 $113.10
17811 549830 2011-04-12 57 $117.83
17811 551870 2011-05-04 132 $194.30
17811 553423 2011-05-17 90 $187.42
17811 554362 2011-05-24 43 $150.35
17811 555548 2011-06-05 99 $147.05
17811 558107 2011-06-26 75 $124.76
17811 559671 2011-07-11 55 $114.95
17811 560820 2011-07-21 127 $146.66
17811 563062 2011-08-11 70 $120.42
17811 563843 2011-08-19 57 $108.94
17811 566042 2011-09-08 90 $182.90"), sep = ' '
) %>%
set_names(names(df)) %>%
mutate(across(Order.Revenue, dollar_to_numeric),
across(Order.Date, as.Date))
test.beh = test %>%
mutate(Order.Revenue = log(Order.Revenue)) %>%
group_by(Customer.ID) %>%
summarise(
total_revenue = sum(Order.Revenue),
n_orders = n(),
recency = as.numeric((max(df$Order.Date) + 1) - max(Order.Date)),
median.rev_per_order = median(Order.Revenue),
rev_per_art_per_order = median(Order.Revenue / N.articles),
total.articles = sum(N.articles),
median.art_per_ord = median(N.articles),
longevity = as.numeric(max(Order.Date) - min(Order.Date)),
Consistency = as.numeric(mean(diff(Order.Date)))
)
test.cluster_dat = test.beh %>%
select(-Customer.ID, -longevity) %>%
mutate(across(c(total_revenue, n_orders, recency, rev_per_art_per_order,
total.articles, median.art_per_ord),
log)) %>%
as.matrix
pca.test = predict(pca, newdata = test.cluster_dat)[, 1:3]
Compute the Mean Squared residual between the new customer’s data and the centroid of each cluster.
We use this to assign the customer where MSE is minimized
centroids = cluster_model$centers
for (i in 1:nrow(centroids)) {
print(MSE(centroids[i, ], pca.test))
}
## [1] 2.413387
## [1] 7.585298
## [1] 7.3317
## [1] 4.872944
This customer would be classified into cluster 1.
# Get the revenue predictions from the untransformed data
df %>%
filter(Customer.ID %in% beh$Customer.ID[cluster_model$cluster ==1]) %>%
group_by(Customer.ID) %>%
summarise(total_revenue = sum(Order.Revenue)) %>%
summarise(difference = median(total_revenue) - sum(test$Order.Revenue))
## # A tibble: 1 × 1
## difference
## <dbl>
## 1 -2532.
Unfortunately, the predicted Revenue for the customer is already less than what they have generated so far in the test data. The model is likely ill-fitting.
A sensible second approach would be to take some of the features extracted from the past order history, and fit regression models
Describe a problem you’ve observed in online shopping or online advertising and how you would approach solving this problem, from a data science perspective.
One problem I’ve observed in online shopping is in knowing what information a customer is absorbing when viewing a webpage with an array of products and information, that ultimately leads them to their next page, destination, or purchase. Attention and perception are tightly coupled with eye gaze location, but this information is unavailable to companies and ad servers. To make matters worse, users may be employing short-term memory to return to pages that have nothing to do with the ones they’ve viewed most recently. Despite these challenges, I propose that semantic network analysis could provide information to accurately predict products that a user is interested in, increasing the ability to predict whether a subsequently served ad will be relevant and clicked.
The basic assumption is that users are not only searching and internally filtering through ads with an internal model of what is relevant and irrelevant. Users are also learning about what is available from viewing other products and ads. By tracking the attributes of all ads and products on a page, we don’t need explicit knowledge of if the user clicked an ad to determine if they looked at an ad, and even mentally engaged with it. If the subsequent pages they attend to are more semantically related to the content of some ads or products on previous pages (or on multiple of the previous pages visited during shopping), then we can infer they had attended to some of those semantically related ads. Because short-term memory can’t possibly maintain all of the information, it is likely that subsequent pages will be related to previous ones, and sequentially become more relevant to the user’s goals/intersts. Finally, at this point, when information has been accumulated about the types of product attributes that are relevant to the user at that time, we can serve an ad that is semantically related, and therefore relevant, with higher propensity of it leading to a click or purchase.
To solve this problem, we can measure the attributes, or semantic content, of ads and products. This can include broad categorical information, such as ‘is a car’, as well as more specific information such as ‘runs on gas’. Building a semantic model then enables us to determine which features that are similar between subsequently shown ads and products were actually attended, because those features will remain a common thread throughout the multiple page impressions the user makes during shopping. Therefore, to use this information to infer the relevant goals of a user and to anticipate a useful ad, we need only keep track of, and continually update, a semantic network that relates ads and products that were shown on consecutive pages across the user’s journey, and compute the similarity (in real time) between the prominent modules in that network with available ads to serve.